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We develop, analyze and experiment with a new tool, called madmx, which extracts frequent motifs, 
possibly including don't care characters, from biological sequences. We introduce density, a simple and 
flexible measure for bounding the number of don't cares in a motif, defined as the ratio of solid (i.e., 
different from don't care) characters to the total length of the motif. By extracting only maximal dense 
motifs, MADMX reduces the output size and improves performance, while enhancing the quality of the 
discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, 
which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the 
generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of 
the motifs returned by madmx. 

1 Introduction 

The discovery of frequent patterns (motifs) in biological sequences has attracted wide interest in recent years, 
due to the understanding that sequence similarity is often a necessary condition for functional correlation. 
Among other applications, motif discovery proves an important tool for identifying regulatory regions and 
binding sites in the study of functional genomics. From a computational point of view, a major complication 
for the discovery of motifs is that they may feature some sequence variation without loss of function. The 
discovery process must therefore target approximate motifs, whose occurrences are similar but not necessarily 
identical. Approximate motifs are often modeled through the use of the don't care character in certain 
positions, which is a wild card matching all characters of the alphabet, called solid characters '10'. 

Finding interesting approximate motifs is computationally challenging. As the number of don't cares 
increases and/or the minimum frequency threshold decreases, the output may explode combinatorially, even 
if the discovery targets only maximal motifs — a subset of the motifs which implicitly represents the complete 
set. Moreover, even when the final output is not too large, partial data during the inference of target motifs 
might lead to memory saturation or to extensive computation during the intermediate steps. 

A large body of literature in the last decade has dealt with efficient motif discovery [SI |31 HH IH HH 
IHl [SI E H] , and an excellent survey of known results can be found in the book ^lOj . In order to alleviate 
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the computational burden of motif extraction and to limit the output to the most promising or interesting 
discoveries, some works combine the traditional use of a frequency threshold with restrictions on the flexibility 
of the extracted motifs, often captured by limitations on the number of occurring don't cares. 

In a recent work, Apostolico et al. [2] study the extraction of extensible motifs, comprising standard 
don't cares and extensible wild cards. The latter are spacers of variable length that can take different size 
(within pre-specified limits) in each occurrence of the motif. An efficient tool, called VARUN, is devised in 
[2] for extracting all maximal extensible motifs (according to a suitable notion of maximality defined in the 
paper) which occur with frequency above a given threshold cr and with upper limits D on the length of 
the spacers, varun returns the extracted motifs sorted by decreasing z-score, a widely adopted statistical 
measure of interestingness. The authors demonstrate the effectiveness of their approach both theoretically, 
by proving that each maximal motif features the highest z-score within the class of motifs it represents, and 
experimentally, by showing that the returned top-scored motifs comprise biologically relevant ones when run 
on protein families and dna sequences. 

A slightly more general way of limiting the number of don't cares in a motif has been explored in I13j . 
The authors define {L, W) motifs, for L < W, where at least L solid characters must occur in each substring 
of length W of the motif. They propose a strategy for extracting {L, W) motifs which are also maximal, 
although their notion of maximality is not internal to the class of (L, W) motifs. As a consequence, the 
algorithm is not complete, since it disregards all those (L, W) motifs that are subsumed by a maximal 
non-(L, W) one. 

Our results. Our work focuses on the discovery of rigid motifs, which contain blocks of solid characters 
(solid blocks) separated by one or more don't cares. We propose a more general approach for controlling the 
number of don't cares in rigid motifs. Specifically, we introduce the notion of dense motif, a frequent pattern 
where the fraction of solid characters is above a given threshold. Our density notion is more flexible and 
general than the one considered in [TOj [2] , since it allows for arbitrarily long runs of don't cares as long as the 
fraction of solid characters in the pattern is above the threshold. We define a natural notion of maximality 
for dense patterns and devise an efficient algorithm, called madmx (pronounced Mad Max), which performs 
complete MAximal Dense Motif extraction from an input sequence, with respect to user-specified frequency 
and density thresholds. 

The key technical result at the core of our extraction strategy is a closure property which affords the 
complete generation of all maximal dense motifs in a breadth-first fashion, through an apriori-like strategy 
[1], starting from a relatively small set of solid blocks, and then repeatedly applying a suitable combining 
operator, called fusion, to pairs of previously generated motifs. In this fashion, our strategy avoids the 
generation and consequent storage of intermediate patterns which are not in the output set, which ensures 
time and space complexities polynomial in the combined size of the input and the output. 

We performed a number of experiments on madmx to assess the biological significance of maximal dense 
motifs and to compare madmx against its most recent and close competitor varun. For the first objective, 
we used madmx to extract maximal dense motifs from a number of human dna fragments. We compared the 
output set against those in RepBase [7 , the largest repository of repetitive patterns for eukaryotic species, 
using repeatmasker il5^, a popular tool for masking repetitive dna. The experiments show that all of 
our returned motifs are occurrences of patterns in RepBase, and fully characterize the family of SINe/alu 
repeats (and partially the line/l1 family). This provides evidence that the notion of density, when applied 
to rigid motifs, captures biological significance. 

Next we compared the z-score performance of madmx and varun. We ran both algorithms on several 
families of dna fragments, limiting varun to the generation of rigid motifs and setting the parameters 
so as to obtain comparable output sizes, with motifs listed by decreasing z-score. The experiments show 
that the top-m highest-ranking motifs returned by MADMX almost always feature higher z-scores than the 
corresponding top-m ones returned by varun, even for large values of m, with only a modest increase in 
running time, which may be partly due to the fact that coding of MADMX is yet to be optimized. In fairness, 
we must remark that VARUN deals also with extensible motifs while madmx only targets rigid motifs. 

The paper is organized as follows. In Section [2] several technical definitions and properties of motifs with 
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don't cares are given. Section [3] proves the closure property at the base of MADMX and provides a high-level 
description of the algorithm. In Section |4l the experimental validation of MADMX is presented. 



2 Preliminary Definitions and Properties 

Let E be an alphabet of m characters and lot s = s[0]s[l] . . . s[n — 1] be a string of length n over E. We 
use s[i ■ ■ -j] to denote the substring s[i] s[i + 1] ■ ■ ■ s[j] of s, for i < j. Characters in E are also called solid 
characters. We use o ^ E to denote a distinguished character called wild card or don't care character. Let e 
denote the empty string. A pattern a; is a string in {e} U E U E(E U {o})*E. However, whenever necessary, 
we will assume that patterns are implicitly padded to their left and right with arbitrary sequences of don't 
care characters. 

Given two patterns x, y we say that y is more specific than x, and write x < y, \E for every z > either 
x[i] — y[i] or x[i\ — o. Given two patterns x, y we say that x occurs in y at position £ iff x ^ y[l . . . £+ \x\ — 1]: 
we also say that y contains x. For a string s, the location list of a pattern x in s is the complete set 
of positions at which x occurs in s. We refer to f{x) — \Cx\ as the frequency of pattern x in s. (Note 
that /(e) = n.) As in |16| . the translated representation of the location list Cx = {lo,li,l2, ■ ■ ■ ,lk} is 
t(£^) = {li — Iq, I2 — Iq, ■ ■ ■ , Ik — lo\- Given two patterns x, y, we say that y subsumes a; in s if f{x) = f{y) 
and y contains x. As a consequence, if y subsumes x then t{Cx) = T(Cy). A pattern x is maximal if it is 
not subsumed by any other pattern y. (We observe that this notion of maximality coincides with that of 
[12j.) Given a pattern x, its maximal extension M{x) is the maximal pattern that subsumes x, which can 
be shown to be unique ^12^. 



In what follows, we call solid block a string in E+ and a don't care block a string in o+. Furthermore, 
given a pattern x, dc(a;) denotes the number of don't care characters contained in x. 

Definition 1. The density S{x) of x is: S{x) = 1 — dc(x)/|x|. Given a (density) threshold p, < p < I, we 
say that a pattern x is dense if 6{x) > p. 

Note that a solid block is a dense pattern with respect to every threshold p. 

It is reasonable to concentrate the attention on dense patterns that are not subsumed by any other 
dense pattern, since they are the most interesting dense representatives in the equivalence classes induced 
by "sharing" the same translated representation; these representatives are defined below. 

Definition 2. A dense pattern x is a maximal dense pattern in s if it is not subsumed by any other dense 
pattern x' ^ x. 

Observe that a maximal dense pattern x needs not be a maximal pattern in the general sense, since M. (x) 
might be a nondense pattern. However, every dense pattern x is subsumed by at least one maximal dense 
pattern. In fact, all of the maximal dense patterns that subsume x are dense substrings of M.(x)^ namely, 
those that contain x and are not substrings of any other dense substring of M^{x). We want to stress that 
there might be several maximal dense patterns that subsume x. As an example, for p = 2/3, the dense 
pattern a; = B in the string S = AdBeCf AgBhC is subsumed by maximal dense patterns A o B and B o C, while 
A4(a:) = A o B o C is not dense. 

Definition 3. Given a frequency threshold a and a density threshold p, a pattern a; is a dense maximal 
motif in s if a; is a maximal dense pattern in s with respect to p, and f{x) > a. A dense maximal motif for 
p = 1 is also referred to as maximal solid block. 

Problem of interest. We are given an input string s, a frequency threshold a, and a density threshold p. 
Find all the maximal dense motifs in s. 

In the rest of the paper, we will omit referencing the input string s when clear from the context. An 
important property of maximal dense patterns, which we will exploit in our mining strategy, is that all of 
their solid blocks are maximal solid blocks. This property is stated in the following proposition whose proof, 
omitted for brevity, extends a similar result holding for arbitrary maximal patterns jI61 111] . 
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Proposition 1. Let x he a maximal dense pattern with respect to a density threshold p, and let b = x[i ■ ■ -j] 
be a solid block in x such that x[i — 1] ^ x[j + 1] = o and j > i. Then, b is a maximal solid block. 

3 An Algorithm for MAximal Dense Motif extraction 

In this section we describe our algoritlrm, called madmx (pronounced Mad Max), for MAximal Dense Motif 
extraction. The algorithm adopts a breadth- first apriori-likc strategy [1], similar in spirit to the one devel- 
oped in [2], using maximal solid blocks as building blocks by Proposition [TJ madmx operates by repeatedly 
combining together, in a suitable fashion, pairs of maximal dense motifs, and extracting from the combina- 
tions less frequent maximal dense motifs. 

A key notion for the algorithm, underlying the aforementioned combining operations, is the fusion of 
characters/patterns. 

Definition 4. Given three characters c, ci, C2 G S U {o}, we say that c is the fusion of ci and C2, and write 
c = ci V C2, if one of the following holds: 

1. C — Ci= C2] 

2. ci = o, c = C2 7^ o; 

3. C = Cl 7^ O, C2 = o. 

The above notion of fusion generalizes to patterns as follows. 

Definition 5. Given three patterns a;,?/, z and an integer d, we say that z is the d-fusion of x and y, and 
write z — X SJd y, if z can be obtained by removing the leading and trailing don't care characters from the 
pattern m defined as m[i] = a;[i -I- d] V 2/[*]j for ^-H indices i. 

The breadth-first strategy adopted by our algorithm crucially relies on the following theorem, which 
highlights the structure of dense motifs: 

Theorem 1. Let x be a maximal dense motif with dc{x) > 0. Then: 

(a) there exists a maximal solid block b in x such that Ai{x) — M{b), or 

(b ) there exist two maximal dense motifs yi , y2 such that: 

• M{x) ^ M[yi Vd 2/2); for some d; 

• there are two maximal solid blocks bi, 62 in x and an integer d > such that bi is a maximal solid 
block in yi, 62 is a maximal solid block in y2, and bi o'^ 62 is contained in yi \/d 2/2,' 

• /(a;) < min{/(yi),/(y2)}; 

For the proof of Theorem [T] we need to define another type of pattern combination, namely the operation 
of merge between two patterns, which is similar to the one introduced in |12| . Given two characters Ci,C2, 
we define the operator ® between them such that ci ® C2 = o, if ci ^ C2, and Ci © C2 = ci = C2, otherwise. 

Definition 6. Given two patterns x, y and an integer d, the d-merge of x and y is the pattern z = x y 
which can be obtained by removing all leading and trailing don't cares from the pattern m defined as 

m[i] = x[i + d] (B y[i] for all i. 

We want to stress the difference between the notions of merging and fusion: the merge of two patterns 
X, y is always well defined and more general than x, y, while the fusion of x, y may not exist and, if it does, 
is more specific than x, y. 

For the proof of Theorem [1] we also need the property established by the following lemma. 
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Lemma 1. Let x and y he maximal patterns, and d be an integer such that z = x ®d D ^ (■ Then z is a 
maximal pattern. Moreover, if z ^ x (resp., z ^ y) then f{z) > f{x) (resp., f{z) > ,f{y)). 

Proof. First we prove that z is maximal. By contradiction, suppose that this is not the case. Then, there 
exists a position i such that 2;[i] = o and we can replace the o with a solid character c without decreasing 
the frequency of the pattern. (Note that the position of the substitution can be to the left of the first 
character in z or to the right of the last character in z.) Since x and y are more specific than z, to every 
occurrence of x and y in the string corresponds an occurrence of z. Hence, every occurrence of x (resp., y) 
in the string, contains c in its i + dth (resp., ith) position. Therefore, by maximality of x and y, it must be 
z[i] = x[i + d] = y[i] = c, which is a contradiction. The relations between the frequencies of x, y and z follow 
trivially by their maximality. □ □ 

We are now ready to prove the theorem. 

Theorem[^ Given a pattern x and two nonnegative integers i < j, we let x*[i...j] denote the pattern 
obtained by removing all the leading and trailing don't care characters from x[i . . .j]. Since a: is a maximal 
dense pattern and dc{x) > 0, it is easy to see that there exist two dense patterns xi, X2 and an integer d > 
such that X = xi o'^ X2, hence there exists an index si > such that a;* [0 ... si — 1] and a;* [si + 1 . . . |a;| — 1] 
are dense. We call these two patterns the level-1 decomposition of x (observe that many such decompositions 
may exist). Also, we let = and ri = — 1. Now, consider the following iterative process: 

1. If in the level-i decomposition of x both x*[ii ... — 1] and x*[si + 1 . . .r.i] have frequency strictly 
greater than f{x), or at least one of x*[ii . . . Si — 1] and x*[si + \ . . . ri] is a solid block with frequency 
equal to f{x), then terminate; 

2. Otherwise, let y = x*[ii+i . . .^i+i] be (an arbitrary) one of x*[ii . ..Si — 1] or x*[si + \ ...ri] which 
is not a solid block and has frequency equal to f{x). Since y is dense, there exists an index Si+i, 
li+\ < Si+i < ri+i such that x*[ii+i . . . s^+i — 1] and x*[s,:+i + 1 . . . r^+i] arc both dense. Call these 
two patterns the level-(i + 1) decomposition of x. Set i = i + 1 and go to Step[T] 

Assume that the decomposition process ends by finding a solid block b that is a solid block in x and 
has f{b) — f{x). Then, M.{b) — M.{x) and the theorem follows. Otherwise, at the last level j of the 
decomposition, we have that f(x) < mm{ f{x*[£j .. .Sj — 1]), f{x*[sj + 1 .. - rj])}. In this latter case, as 
explained in Section [5] (after Definition [J) , we can determine two maximal dense patterns 2/1,2/2 such that 2/1 
contains x*[ij . . .sj — 1], 2/2 contains x*[sj + 1 . . . r^], and with M{yi) = M{x*[ij . . .sj — I]) and M{y2) = 
M{x*[sj + 1. . .rj]). Since /(2/1) = f{x*[£j ■ --Sj - 1]) and 7(2/2) = f{x*[sj + 1 . . .rj]), we have that f{x) < 
min{/(2/i), / (2/2)}- Observe that by construction there must exist two solid blocks bi, &2 in x and an integer 
d such that bi is a solid block in 2/1, &2 is a solid block in 2/2, and bi o'^ &2 is a sequence of two solid blocks in 
X. In fact, bi (resp., 62) is the last (resp., the first) solid block of x*[£j . . .Sj — 1] (resp., x*[sj + 1 . . . rj]). 

Next, we show that there exists a d such that the d- fusion 2/1 Vd 2/2 is well defined, contains 61 o'' 62, and 
■^{yi Vd 2/2) — M{x). We proceed as follows. Let us "align" M{x) and 2/1 so to match the occurrences 
of 61 in both patterns. Then, for a certain integer p, A4(x)[i + p] corresponds to yi[i]. Assume, for the 
sake of contradiction, that there exists an index j such that Ai (x) [j + p] is not more specific than 2/1 [j] ■ 
Then, Lemma [1] implies that z = A4{x) (Bp M{yi) 7^ A4{yi), which contains x*[ij . . . Sj — 1], is maximal 
and has frequency strictly greater than /(2/1), which is impossible because we have chosen 2/1 such that 
M{x*[ij ...Sj — 1]) = M{yi) and therefore f{x*[£j ...sj — 1]) = /(2/1). Therefore, M{x) contains 2/1. A 
similar argument shows that A4{x) contains 2/2- 

Since 2/1 and 2/2 are contained in A4{x), there must exist a d such that 2/1 Vd V2 is well defined and can 
be aligned with A4(x) in such a way to match the blocks bi and 62 of 2/1 and 2/2 with the corresponding 
blocks in M{x). Moreover, M{x) contains 2/1 Vd 2/2, hence /(2/1 Vd 2/2) > f{M{x)) = f{x). However, 
since 2/1 Vd 2/2 contains both x*[lj . . .Sj — 1] and x*[sj + 1 . . . r^], it contains also x*[tj . . . rj], which, by the 
decomposition process, has frequency equal to f{x). Therefore, f{yi Vd2/2) ^ .f{^)^ and the theorem follows 
since /(2/1 Vd 2/2) = f{x). □ □ 
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Algorithm 1: madmx 
Input: String s, frequency threshold a, density threshold p 
Output: Maximal dense motifs 

1 previous -S— 0, current 0, next ; 

2 blocks -f- maximal solid blocks of s with frequency > a; 

3 for each b £ blocks do 

4 find Mib) ■ 

5 VM ^ extractMaximalDense(A1(6)); 

6 for each x G "DM do current •<— current U {x}; 

7 while current 7^ do 

8 for each xi £ current do 

9 for each 2:2 G previous U current do 

10 for each d s.t. 2 = xi Vrf 2:2 is a valid fusion do 

11 find M{z)\ 

12 DM extractMaximalDense(A4(2:)); 

13 for each x e VM do 

14 if f{x) > a and x ^ previous U current then next nea;i U {x}; 

15 previous previous U current; 

16 current next; nesf 0; 

17 return previous; 



In essence, Theorem [T] guarantees that we can find any maximal dense motif a; either within Ai{b), for 
some maximal solid block 6, or by d- fusing two higher- frequency maximal dense motifs yi,y2, for some d, 
finding z — A4{yi Vd J/2) and then possibly "trimming" z on both sides to obtain x. 

Algorithm madmx, whose pseudocode is reported in Figure [TJ implements the strategy inspired by 
Theorem [TJ It employs three (initially empty) sets previous, current, and next. In Line[5J the algorithm first 
stores the maximal solid blocks 6 in s for the given frequency in the set blocks (see Section[2|). Then, it extracts 
all of the appropriate maximal dense motifs from (6) in Lines[3H6l using the function extractMaximalDense, 
as implied by Theorem [IJa) . Finally, Lines [THTBl implement the strategy as implied by Theorem [IJb). (In 
Line[TU]a d-fusion j/i sjd 2/2 is considered valid if it satifies the second property of Theorem [IJb).) 

An important issue for the efficiency of madmx is that it needs to compute the exact frequency of each 
generated pattern. For what concerns the fusion operation of two patterns xi , X2 in Line I10[ observe that a 
simple computation on the pairs (ii, £2) € x is sufficient to yield the frequencies of all the valid fusions 
of two patterns. However, given z — xi \JdX2, for a maximal dense pattern w which does not contain z in its 
entirety, we can only conclude that f{w) > f{z). We then label the motifs for which the exact frequencies 
are known as final, and those for which only a lower bound to their frequencies is known as tentative, and 
update the lower bounds and the labels during the execution of the algorithm. Whenever the set current 
contains no final motifs, we can label as final the tentative motif in current with the highest lower bound to 
its frequency, and continue with the generation. The proof of the correctness of this assumption and further 
details on the implementation of the algorithm will be provided in the full version of this extended abstract. 
A crude upper bound on the running time of madmx can be derived by observing that, for each pair of 
dense maximal motifs in output, the time spent during all the operations concerning that pair is (naively) 
O {n^) , where n is the length of the input string. If P patterns are produced in output, the overall time 
complexity is O (n^P^) . 




Figure 1: Pseudocode of algorithm madmx. 
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4 Experimental Validation of MADMX 



We developed a first, non-optimized, implementation of madmx in C++ also including an additional feature 
which eliminates, from the set of initial maximal solid blocks, those shorter than a given threshold min£. The 
purpose of this latter heuristics is to speed up motif generation driving it towards the discovery of (possibly) 
more significant motifs, with the exclusion of spurious, low-complexity ones. (The code is available for 
download at http : //www. dei .unipd. it/wdyn/?IDsezione=4534,) 

We performed two classes of experiments to evaluate how significant is the set of motifs found using 
our approach. The first class of experiments, described in Section [411 compares our motifs with the known 
biological repetitions available in RepBase [7j, a very popular genomic database. The second class of experi- 
ments, described in Section aims at comparing the motifs extracted by madmx with those extracted by 
VARUN using the same z-score metric employed in [5] for assessing their relative statistical significance. 

4.1 Evaluating significance by known biological repetitions 

RepBase [7 is one of the largest repositories of prototypic sequences representing repetitive dna from different 
eukaryotic species, collected in several different ways. RepBase is used as a reference collection for masking 
and annotation of repetitive dna through popular tools such as repeatmasker 15,. repeatmasker 
screens an input dna sequence s for simple repeats and low complexity portions, and interspersed repeats 
using RepBase. Sequence comparisons are performed through Smith- Waterman scoring, repeatmasker 
returns a detailed annotation of the repeats occurring in s, and a modified version of s in which all of the 
annotated repeats are masked by a special symbol (N or X). With the current version of RepBase, on average, 
almost 50% of a human genomic dna sequence will be masked by the program |15| . 

Most of the interspersed repeats found by repeatmasker belong to the families called sine/ ALU and 
LINe/l1: the former are Short INterspersed Elements that are repetitive in the dna of eukaryotic genomes 
(the Alu family in the human genome); the latter are Long Interspersed Nucleotide Elements, which are 
typically highly repeated sequences of 6K-8K bps, containing rna polymerase II promoters. The line/l1 
family forms about 15% of the human genome. 

We have conducted an experimental study using madmx and repeatmasker on Human Gluta- 
mate Metabotropic Receptors HGMR 1 (410277 bps) and HGMR 5 (91243 bps) as input sequences. 
We have downloaded the sequences from the March 2006 release of the UCSC Genome database 
(http : //genome . ucsc . edu). REPEATMASKER version was open-3.2.7, sensitive mode, with the query species 
assumed to be homologous; it ran using blastp version 2.0al9MP-WashU, and RepBase update 20090120. 

The experiments to assess the biological significance of the maximal dense motifs extracted by madmx 
involved three separate stages. In the first stage, we ran repeatmasker on the input sequences hgmr 1 and 
HGMR 5, searching for interspersed repeats using RepBase. One of the output files ( . out) of repeatmasker 
contains the list of found repeats, and provides, for each occurrence, the substring s[i . . . j] of the input 
sequence s which is locally aligned with (a substring of) the repeat. 

In the second stage, we ran madmx on the same DNA sequences, with density threshold p — 0.8, frequency 
threshold cr = 4, and min^ = 15. In order to filter out simple repeats and low complexity portions, which 
are dealt with by repeatmasker without resorting to RepBase, we modified madmx eliminating periodic 
maximal solid blocks (with short periods), which are the seeds of simple repeats. Then, we identified the 
occurrences of the motifs returned by madmx in the input sequences, using repeatmasker as a pattern 
matching tool (i.e., replacing RepBase with the set of motifs returned by madmx as the database of known 
repeats). The underlying idea behind this use of repeatmasker was to employ the same local alignment 
algorithms, so to make the comparison fairer. 

In the third stage, we cross-checked the intervals associated with the occurrences of the RepBase repeats 
against those associated with the occurrences of our motifs. Surprisingly, madmx was able to identify and 
characterize all of the intervals of the known sine/ ALU repeats in hgmr 1 and hgmr 5 (respectively, 56 
repeats plus an extra unclassified for hgmr 1, and 20 plus an extra unclassified for hgmr 5). The remaining 
occurrences of the motifs permitted to identify 29 repeats out of 78 of the line/l1 family in hgmr 1. (A 
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more detailed account of the whole range of experiments conducted using repeatmasker and the data sets 
by Tompa et el. and Sandve et al. will be provided in the full version.) 

4.2 Evaluating significance by statistical z-score ranking 

The z-score is the measure of the distance in standard deviations of the outcome of a random variable from 
its expectation. Consider a dna sequence s of length n as if it was generated by a stationary, i.i.d. source 
with equiprobable symbols; an approximation to the z-score for a motif of length m that contains c solid 
characters and appears / times in s is given hy Z — / (" "'+i)xp ^ where p — (I /AY. This metric was 

y (n—m+l) xpx (1— p) 

used in [2] to assess the significance of the motifs extracted by VARUN and to rank them in the output. 

We employed the code for varun provided by the authors to extract the rigid motifs from the DNA 
sequences analyzed in [2]. We then ran madmx on the same sequences using the same frequency parameters, 
and setting the minimum density threshold p in such a way to obtain a comparable yet smaller output size. 
In this fashion, we tested the ability of madmx to produce a succinct yet significant set of motifs, by virtue 
of its more flexible notion of density. 

The results are shown in Table [TJ For varun we used D = 1, thus allowing at most one don't care 
between two solid characters, and ran madmx with ming = 1, so to obtain the complete family of maximal 
dense motifs. In the table, there is a row of the table for each sequence (identified in the first column). 
Each sequence, whose total length is reported in the second column, is obtained as the concatenation of a 
number of smaller subsequences, reported in the third column. On each sequence, both tools were run with 
the same frequency threshold cr, and the table reports for both the output size in terms of the number of 
motifs returned and the execution time in seconds. Also, for madmx, the table reports the density threshold 
p used in each experiment. 











VARUN 


MADMX 


best top-m z-scores 


name 


length 


# 




1 output 1 


time 


P 


1 output 1 


time 


m=10 


m=50 


m=100 


m* 


m 


ace2 


500 


1 


2 


1866 


3s 


0.7 


1762 


18s 


10 


50 


100 


1571 


1067 


apl 


500 


1 


2 


1555 


Is 


0.7 


1304 


5s 


10 


50 


100 


392 


13 


gal4 


3000 


6 


4 


9764 


12s 


0.67 


7606 


67s 


10 


49 


99 


16 


16 


gal4(*) 


3000 


6 


4 


9764 


12s 


0.65 


11733 


191s 


10 


50 


100 


9764 


301 


uasgaba 


1000 


2 


2 


4586 


30s 


0.70 


4194 


90s 


10 


50 


100 


175 


175 



Table 1: Results of the comparison with varun. 

For each experiment, we compared the best top-m z-scores, with m = 10, 50, and 100, as follows. Note 
that, in general, the top-m motifs found by madmx and varun differ. Thus, we let z\j (resp., Zy) be the 
z-score of the ith motif in decreasing z-score order obtained by madmx (resp., varun). For each m, the 
table reports how many times it was z]^,^ > Zy, for 1 < i < m. Also, column m* (resp., column m) gives the 
maximum m such that z'j^,j > Zy (resp., z]^^ > Zy) for every 1 < i < m. 

Even when madmx is calibrated to yield a slightly smaller output, the quality of the motifs extracted, 
as measured by the z-score, is higher than those output by varun. Indeed, for sequences ace2 and uasgaba 
a very large prefix of the top-ranked motifs extracted by madmx features strictly greater z-scores of the 
corresponding top-ranked ones extracted by varun. In fact, for all of the four sequences, at least the 
thirteen top-ranked motifs enjoy this property. To shed light on the slightly worse performance of madmx 
on gal4, we re-ran MADMX with a different density threshold, so to obtain a slightly larger output (see 
row gal4'^*)). In this case, the top-301 motifs extracted by madmx have z-score strictly greater than the 
corresponding motifs extracted by varun, while the execution time remains still acceptable. 

For all runs, the top z-score of a motif discovered by madmx is considerably higher than the one returned 
by VARUN. Specifically, on ace2 our best z-score is 387 763 vs. 12 027 of varun; on apl, we have 12 027 
vs. 1490; on gal4 it is 75 vs. 28; on gal4(*) it is 150 vs. 28; on uasgaba we have 134532 vs. 67059. This 
reflects the high selectivity of madmx, which is to be attributed mostly to adoption of a more flexible density 
constraint. 
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Wc must remark that MADMX (in its current nonoptimized version) is slower than VARL'N. but it still runs 
in time acceptable from the point of view of a user. To further investigate the tradeoff between execution time 
and significance of the discovered motifs, we repeated the experiments running madmx with min^ = 2 and 
p = 0.65, for all sequences. The running time of madmx was almost halved, while the small output produced 
still featured high quality. In fact, for sequences ace2, apl, and uasgaba the top- 100 motifs extracted by 
MADMX have z-score greater or equal than the corresponding ones returned by varun. 

We also have attempted a comparison between varun and madmx on longer sequences (such as iigmr, 1) 
at higher frequencies (since, unfortunately, VARUN does not seem to be able to handle low frequencies on 
very long sequences). Even allowing a higher number of don't cares between solid characters {D = 2) for the 
motifs of VARUN, all of the top-m z-scores featured by the motifs extracted by madmx are greater than or 
equal to the corresponding scores in the ranking of VARUN, with m reaching the size of VARUN's output. In 
fairness, we remark that VARUN was designed to work at its best on protein sequences, while madmx's main 
target are dna sequences. Hence, these two tools should be regarded as complementary. Moreover, VARUN 
has the advantage of retrieving flexible motifs, while MADMX focuses only on rigid ones. 

Acknowledgments The authors wish to thank Alberto Apostolico and Matteo Comin for providing the 
code and giving valuable insights on varun, Ben Raphael for suggesting the use of repeatmasker, and 
Roberta Mazzucco and Francesco Peruch for coding madmx. 
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